Chapter 1: Introduction to Open Data Science


In this course, we will learn to take advantage of the basic tools of open data science. They include:


My first thoughts when I found this course:

I have been looking for something like this.

Datacamp also seems like a nice learning platform becaue R is used through the browser and you can get immediate feedback.

My name is Janne Mikkonen. My GitHub repository can be found here.


Chapter 2: Regression and model validation



Pre-examination

I start by reading the “learning2014” data file that I have created before and exploring its dimensions and structure.

setwd("~/Documents/GitHub/IODS-project/data")
learning2014 <- read.csv(file = "learning2014.csv")
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This data set includes 166 observations and 7 variables: gender, age, attitude, deep, stra, surf and points. Gender is a factor variable that sepates women and men. Age is an integer variable measured in years. Variables from attitude to surf are averages of several combined variables. Attitude measures global attitude towards statistics. Deep measures the dimension of deep learning, stra strategic learning and surf surface learning. Points indicates the total amount of exam points.

I continue by exploring the data graphically. We can plot an overview of all variables in the data with the help of GGally and ggplot2 packages.

library(GGally)
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

There are more women than men in the data and most participants are in their 20s. Exam points has rather a strong positive correlation with attitude. The dimensions of deep learning, strategic learning and surface learning don’t have strong positive correlations wit each other, which is a good thing because they should measure separate dimensions. Deep learning and surface learning have a relatively strong negative correlation, as could be expected. Attitude and learning dimensions have almost normal distributions, whereas for exam points the lowest end of the scale is slightly underrepresented.

We can also look at the association between attitude and exam points by drawing a scatter plot. I willinclude separate regression lines for women and men to evaluate whether the associations differ by gender.

p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3 + ggtitle("Student's attitude versus exam points by gender")
p4

There seems to be a clear association between attitude and points as could already be seen from the correlations. There is not much difference between women and men.

Estimation

I will continue by estimating a multiple regression model for exam points with three explanatory variables: attitude, strategic learning and surface learning. These variables had the largest correlations with exam points.

model1 <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Attitude was the only varible to have a clear and statistically significant (at the conventional p<0.05 level) association with exam points: the increase of one point in attitude is associated with an average increase of 3.4 points. The p-value of p<0.001 tells us that there is a very small likelihood to observe an association this large if the zero hypothesis is true (i.e. relationship is zero in the population). Since the dimensions of strategic and surface learning showed no associations, I refit the model by excluding them.

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The coefficient of attitude remains about the same as before. An increase of one point in attitude is associated with an average increase of 3.5 points. The intercept of the model tells us that if attitude is zero, the estimated exam points is 11.6. Based on the multiple R-squared, we can conclude that attitude towards statistics explains around 19% of the variation in exam score.

Model validation

I will next explore the assumptions of the linear regression model by plotting the residuals of model2 with three diagnostic plots.

par(mfrow = c(2,2))
plot(model2, which = c(1, 2, 5))

The first graph, “Residuals vs Fitted” can be used to explore the assumption that the size of the errors does not depend on the values of the explanatory variables. Here, we have simply plotted the residuals (dots) with the model prediction (line). The variance of residuals seems mostly the same throughout the values of x, although it becomes slightly smaller with the largest values of x. Moreover, there are a few cases with unexpectedly low exam scores when x is between 24 and 26.

QQ-plot is used for exploring the assumption that the errors are normally distributed. If the points follow the line drawn in the graph, we can conclude that the errors are normally distributed. In this case, there seems to be some small violation in the extremes of the distribution, but overall the assumption holds reasonably well.

“Residuals vs leverage” plot helps to identify which observations have an unusually high impact on the model. This plot includes leverage, which indicates how much the observation’s value on the predictor variable differs from its mean, and “Cook’s distance”, which is a measure of much the exclusion of that observation would change our model results. Problematic observations have both high leverage and high Cook’s distance. In our situation, we cannot even see the dashed Cook’s distance curves in the plot so there seems to be no problem with outliers.

To conclude, we only found evidence for an association between attitude towards statistics and exam score, whereas strategic learning and deep learning did not seem to predict performance in the exam.


Chapter 3: Logistic regression



Reading the data

I start by reading the “alc” data file that I created in the data wrangling excercise. This data file is a combination of two data sets measuring student performance in (1) mathematics and (2) portuguese language in two Portuguese secondary-level schools.

setwd("~/Documents/GitHub/IODS-project/data")
alc <- read.csv(file = "alc.csv")

After loading the data, I continue by exploring its structure and dimensions.

str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382  35

Our data includes 35 variables (columns) and 382 observations (rows). When creating the combined analysis data we have done the following adjustments:

  • The variables not used for joining the two data sets have been combined by averaging (including the grade variables)
  • ‘alc_use’ is the average of alchohol consumption on workdays (Dalc) and on weekends (Walc)
  • The logical variable ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

Hypotheses

In this excercise I will examine the effect of sex, urban/rural residence, mother’s education and attendance to extracurricular activities on the probability of high alcohol use. I hypothesize that

  • Boys are more likely to consume high amounts of alcohol than are girls.
  • Those living in an urban address are more likely to consume high amounts of than those living in rural areas due to better access to alcohol.
  • Those going out with friends often are more likely to use high amounts of alcohol than those going out less frequently because alcohol is often used with friends.
  • Those attending extracurricular activities are less likely to consume high amounts of alcohol because these students are more study-oriented and thus have other things to do.

Data exploration

Next, I will explore the distributions of the outcome variable and the four chosen explanatory variables. I create a new data frame “alc5” that only includes these five variables. I start by exploring the distributions graphically.

library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc5 <- select(alc, sex, address, goout, activities, high_use)
gather(alc5) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

Around half of the students attend extracurricular activities. Living in an urban area is more common than living in a rural area, and the extremes of going out with friends are less common than the middle of the scale. Girls are only slightly overrepresented in the sample. Around one third of the sample engages in high alcohol use.

I continue by looking at the relative frequencies of low/high alcohol use by the four explanatory variables.

table(alc5$activities, alc5$high_use) %>% prop.table(1) %>% round(2)
##      
##       FALSE TRUE
##   no   0.67 0.33
##   yes  0.73 0.27
table(alc5$address, alc5$high_use) %>% prop.table(1) %>% round(2)
##    
##     FALSE TRUE
##   R  0.62 0.38
##   U  0.72 0.28
table(alc5$goout, alc5$high_use) %>% prop.table(1) %>% round(2)
##    
##     FALSE TRUE
##   1  0.86 0.14
##   2  0.84 0.16
##   3  0.82 0.18
##   4  0.51 0.49
##   5  0.40 0.60
table(alc5$sex, alc5$high_use) %>% prop.table(1) %>% round(2)
##    
##     FALSE TRUE
##   F  0.79 0.21
##   M  0.61 0.39

Those attending extracurricular activities are slightly less common to have high alcohol consumption, as hypothesized. Opposite to the hypothesis, consuming high amounts of alcohol is less common for those living in urban areas. In line with my hypothesis, high alcohol use is more common among students going out with friends often and male students.

Logistic regression

I move on to estimating a logistic regression model with high alcohol use as the outcome variable and activities, address, going out with friends and sex as explanatory variables.

m <- glm(high_use ~ activities + address + goout + sex, data = alc5, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ activities + address + goout + sex, 
##     family = "binomial", data = alc5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7579  -0.7750  -0.5065   0.7677   2.4999  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.1340     0.4927  -6.361 2.00e-10 ***
## activitiesyes  -0.5145     0.2532  -2.032  0.04215 *  
## addressU       -0.7487     0.2995  -2.500  0.01242 *  
## goout           0.8028     0.1212   6.622 3.55e-11 ***
## sexM            0.9398     0.2536   3.706  0.00021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 392.97  on 377  degrees of freedom
## AIC: 402.97
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
cbind(OR, CI)
##                 OR 2.5 % 97.5 %
## (Intercept)   0.04  0.02   0.11
## activitiesyes 0.60  0.36   0.98
## addressU      0.47  0.26   0.85
## goout         2.23  1.77   2.85
## sexM          2.56  1.57   4.24

All explanatory variables were statistically significantly (p<0.05 and none of the 95% confidence intervals overlapped 1) associated with high alcohol use. The odds of high use for children who participated in extracurricular activities were 0.60-fold when compared with students who did not participate. In other words, this means that the propability of high use was lower among children who participated in extracurricular activities. Similarly, urban residence was associated with lower odds of high alcohol use and male sex with higher odds of high alcohol use. The frequency of going out with friends was a numeric varible in our data: for a one-unit increase in the frequency of going, increase in the odds of high alcohol use was 2.56-fold.

Overall, our hypotheses related extracurricular activities, going out with friends and sex were supported by the logistic regression analysis. On the other hand, I hypothesized that urban residence would be associated with higher propability of high alcohol use, while in the analysis urban residence was in fact associated with much lower odds high use.

Predictive power

Next, I will explore the predictive power of my logistic model where all explanatory variables were clearly associated with the odds of high alcohol use. I begin by creating a 2x2 table, which shows the numbers of correct and false predictions.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc5 <- mutate(alc5, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc5 <- mutate(alc5, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc5$high_use, prediction = alc5$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   23
##    TRUE     71   43

We can see that for 245 cases the model correctly predicted that they do not use high amounts of alcohol and for 43 cases that they do have high alcohol use. On the other hand, 71 cases were incorrectly classified as not-high-users while in reality they were high-users.

I continue by (1) creating a graph of the actual cases and predictions of high use, (2) a propability table of the predictions and finally (3) a function calculating the number of incorrect predictions.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc5, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc5$high_use, prediction = alc5$prediction) %>% prop.table() %>% addmargins() %>% round(2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.64 0.06 0.70
##    TRUE   0.19 0.11 0.30
##    Sum    0.83 0.17 1.00
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc5$high_use, prob = alc5$probability)
## [1] 0.2460733

Looking at the graph, we can see that there were quite many cases who were predicted to have low propability of high alcohol use but were still high users. Around 19% of predictions were incorrect in that way, while only 6% of cases were incorrectly predicted to have high use when in reality they did not have. Overall, around one fourth of cases were incorrectly classified on account of their alcohol use. This result is definitely better than tossing a coin and quite good actually given that we only had four explanatory varibles in the model. At the same time, one fourth is still pretty large a proportion of false predictions and could probably be improved by including more variables.

Cross-validation

Using the above-defined loss function I can perform a ten-fold cross-validation on my model. In this cross-validations the analysis data is first divided into ten equal sized parts. Each part in turn acts a “testing data” while the other nine parts form a “training data”. Training data is used for fitting the model while the testing data is used for testing it. Based, on the ten tests we can calculate the average number of wron predictions.

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc5, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2460733

In the ten-fold cross-validation the predition error was around one fourth i.e. about the same as the prediction error of our original full-sample model. My prediction error was also about the same as in the Datacamp excercise, although I mostly used a different set of explanatory variables.


Chapter 4: Clustering and classification



Reading the data

I start by loading the Boston Housing Values data set used in the analysis. I will also explore its structure and dimensions.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

This data set contains 14 variables and 506 observations. The observations are suburbs within the Boston region and the varibles measure different attributes of these suburbs. I continue by looking at the distributions and correlations of the variables.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
library(tidyr)
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2) 

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

By looking at the distributions, we can see that there are some suburbs that clearly diverge from the general pattern. For instance, in one suburb the crime rate is 89 per capita (crim) while the median in Boston is only 0.26. A similar pattern can be seen for the proportion of residential land zoned for lots over 25,000 sq.ft. (zn), the proportion of blacks (black) and the median value of owner-occupied homes (medv).

The correlation plot indicates which variables have the strongest positive (blue) and negative (red) correlations. Some of the strongest positive correlations can be found between full-property taxes (tax) and accessibility to radial highways (rad) as well as nitrogen oxides and the proportion of non-retail business. Distances to five employment centers (dist) in turn have strong negative correlations with nitrogen oxides and non-retail businesses. On the other hand, Charles River dummy varible does not show correlations with any other variables.

Modifications

The variables need to be centered and standardized, i.e. scaled, for the classification and clustering analyses.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

We can see that all variables have mean zero. The outlier cases can still be clearly seen. Next, I will create a categorical variable indicating the quantiles of the (scaled) crime rate. This variable replaces the original continuous variable.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
summary(crime)
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Everything seems to be fine with the categorical variable so I will continue by splitting the original data set to test (80%) and train (20%) data sets. The training of the model is done with the train set and prediction on new data is done with the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Linear discriminant analysis

Now, I am ready to fit the linear discriminant analysis on the train set. This analysis finds the linear combination of the variables that separate the target variable classes. The categorical crime rate will serve as the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2475248 0.2450495 0.2599010 
## 
## Group means:
##                   zn      indus         chas        nox          rm
## low       0.93971091 -0.9037551 -0.154216061 -0.8524253  0.46388793
## med_low  -0.09512877 -0.2775628  0.003267949 -0.5394142 -0.15236518
## med_high -0.40754935  0.2325600  0.284432582  0.3928027  0.07661393
## high     -0.48724019  1.0149946 -0.084848104  1.0189743 -0.39189006
##                 age        dis        rad        tax     ptratio
## low      -0.8785045  0.8082368 -0.6924575 -0.7448635 -0.44413755
## med_low  -0.3193345  0.3101668 -0.5557899 -0.4856921 -0.09632279
## med_high  0.3931425 -0.3566504 -0.3844363 -0.2697891 -0.19734943
## high      0.7959843 -0.8414701  1.6596029  1.5294129  0.80577843
##               black       lstat         medv
## low       0.3787351 -0.76658957  0.555522002
## med_low   0.3492143 -0.13717334 -0.004763056
## med_high  0.1010690  0.06572706  0.156562428
## high     -0.8916147  0.90277143 -0.711553858
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10701570  0.780418497 -0.87187927
## indus    0.03161207 -0.262327832  0.31197954
## chas    -0.06312356 -0.113448997  0.09289584
## nox      0.26133506 -0.697504226 -1.49379684
## rm      -0.06669223 -0.121530860 -0.19685362
## age      0.31164820 -0.277613877 -0.08468761
## dis     -0.06587278 -0.405499060  0.08158625
## rad      3.19136586  1.096457917  0.03493189
## tax     -0.00604261 -0.238351641  0.54822209
## ptratio  0.15667995 -0.052300407 -0.40366861
## black   -0.18810129  0.006945473  0.16289532
## lstat    0.19971318 -0.407760364  0.17451790
## medv     0.17617835 -0.494365901 -0.39829530
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9513 0.0370 0.0117
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The plot reveals that the suburbs of crime rate are the most distinctive with regards to our selection of predictor variables. Accessibility to radial highways is the most distinctive feature of the high crime rate areas. Med high and med low are the most overlapping.

In order to examine how well our model performs in predicting crime rate, I save the correct classes of the test data as a separate object.

library(dplyr)
# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Now I can predict the classes with the LDA model on the test data and compare the predictions results with the actual categories.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       7        3    0
##   med_low    5      17        4    0
##   med_high   2       7       18    0
##   high       0       0        1   21

All suburbs with high crime rate were correctly predicted. This is not a surprise given that high crime rate suburbs turned out so disctinctive in the original analysis. The model had more difficulties in separating med-high suburbs from med-low. Low and med low-were somewhat overlapping in the original analysis, and also in this test some of the low crime suburbs were classified as med-low. Despite this, most predictions were correct.

K-means clustering

Next, I will perform a K-means clustering analysis on the Boston data set. I will first reload and scale the data calculate the Euclidean distances between the observations.

data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047

I begin the K-means clustering analysis by estimating four clusters. I will visualise some of the columns.

# k-means clustering
km <-kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston[5:10], col = km$cluster)

Four clusters could be too much as it is difficult distinguish them in the plot. It is possible to identify the optimal amount of clusters by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes.

library(ggplot2)
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

It seems that already after the first cluster there is a clear drop in WCSS. Since WCSS still drops after the second cluster, I will estimate a two-cluster solution.

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[1:7], col = km$cluster)

pairs(Boston[8:14], col = km$cluster)

The two-cluster solution is clearer to interpret than the original four-cluster solution. Accessibility to radial highways is once again one of those variables that define the clusters and full-value property-tax rate is another. These varibles probably identify the central areas of Boston. Also crime rate is one of the attributes that separate the clusters.


Chapter 5: Dimensionality reduction techniques



Reading and exploring the data

Our data set originagates from the United Nations Development Programme. It is a combination of two separate data sets measuring human development and gender (in)equality for the same coutries. I have modified the data so that it only includes 8 indicators of human development / equality and only for those countries that did not have any missing values for these indicators. I start by reading the data and examining its structure and dimensions.

setwd("~/Documents/GitHub/IODS-project/data")
human <- read.csv(file = "human.csv", row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ f2edu        : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ flabour      : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ expedu       : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ le           : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni          : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ matmort      : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adolbirthrate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parliament   : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

We can see that there are 155 observations, i.e. countries, in the data. The 8 variables measure

  • % of women with at least secondary education (f2edu)
  • % of women in labour force (flabour)
  • Expected years of schooling (expedu)
  • Life expectance at birth (le)
  • GNI (gross national income) per capita (gni)
  • Maternal mortality ratio (matmort)
  • Adolescent birth rate (adolbirthrate)
  • Female share of parliamentary seats (parliament)

I continue by examining the distributions of the variables and their relationships with each other.

summary(human)
##      f2edu           flabour          expedu            le       
##  Min.   :  0.90   Min.   :13.50   Min.   : 5.40   Min.   :49.00  
##  1st Qu.: 27.15   1st Qu.:44.45   1st Qu.:11.25   1st Qu.:66.30  
##  Median : 56.60   Median :53.50   Median :13.50   Median :74.20  
##  Mean   : 55.37   Mean   :52.52   Mean   :13.18   Mean   :71.65  
##  3rd Qu.: 85.15   3rd Qu.:61.90   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :100.00   Max.   :88.10   Max.   :20.20   Max.   :83.50  
##       gni            matmort       adolbirthrate      parliament   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(GGally)
ggpairs(human)

If we look at the summary table, we can see that all variables vary strongly between the coutries. For instance, the proportion of women with a least secondary education ranges from 0.9% to 100% and female parliamentary representation from 0% to 57.5%. Some of the varibles have quite strong correlations with each other. For example, life expectancy is strongly associated with higher expected years of education and lower adolescent birth rate. On the other hand, female parliamentary representation does not show strong correlation with any of the other variables. Expected years of education is the only variable with somewhat normal distribution. For maternal mortality, there are a few countries with very high ratios but most countries fall to the lower end of the distribution.

Principal component analysis (PCA)

Next, I will perform a principal component analysis on the data described above. I plot the results with a biplot that shows two of the most important principal components that were identified as x-axis and y-axis.

#The model
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6   PC7
## Standard deviation     1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00  0.00  0.00 0.000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00  1.00  1.00 1.000
##                          PC8
## Standard deviation     1.428
## Proportion of Variance 0.000
## Cumulative Proportion  1.000
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

The plot is quite difficult to read as most countries fall to the top right corner of it. Because we have not standardized the variables, gross national income (GNI) is the most significant due to its large variance. In fact, the first principal component seems to account for 100% of the variance. GNI defines the first principal component, but it is difficult to say what defines the second, as the other arrows are not drawn. Let’s rerun the analysis with scaled variables.

#Scale
human_std <- scale(human)

#Estimation
pca_human <- prcomp(human_std)
s <- summary(pca_human)

# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab <- paste0(c("Underdevelopment", "Gender equality"), " (", pca_pr, "%)")

biplot(pca_human, choices = 1:2, cex = c(0.5, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

After scaling the variables, the plot becomes much clearer to interpret. The biplot arrows reveal that the first principal compoment measures human development or, more precisely, underdevelopment (high maternal mortality, low life expectancy, low education etc), whereas the second component measures gender equality separate of human development (female parliamentary representation and female labour market participation). It is easy to see why because when we looked at correlations, parliament and female labour were the only variables that did not have strong correlations with the other variables. I have also labeled the principal components in the plot accordingly. The proportions of variance accounted for by the components are shown in parenthesis. Human development component accounted for more than half of the variance. Because the negative values of human development component indicate high development, the Nordic countries (highly developed and high gender equality) can be found in the top-left corner of the plot. Interestingly, there are quite many underdeveloped countries with high equality and vice versa.

Multiple correspondence analysis

Next, I move up to multiple correspondence analysis. Here I will use a data set that contains the answers of a questionnaire on tea consumption. I start by loading the data and exploring its structure and dimensions.

library(FactoMineR)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

This tea consumption data set has 300 obsevations (survey participants) and 36 variables (questions). The variables include questions on backbround factors (sex, age etc), tea consumption habits (frequency, during lunch, at work etc), and tea-related notions and attitudes (healthiness, femininity, relaxation).

To improve the clarity of the analysis, I will only select a subset of ten variables.

# column names to keep in the dataset
keep_columns <- c("Tea", "sex", "where", "SPC", "friends", "healthy", "sophisticated", "frequency", "breakfast", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

Next, I will perform a multiple correspondence analysis on these ten variables.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.174   0.166   0.144   0.138   0.122   0.118
## % of var.              9.169   8.726   7.557   7.283   6.395   6.237
## Cumulative % of var.   9.169  17.895  25.452  32.735  39.130  45.367
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.113   0.107   0.105   0.093   0.089   0.087
## % of var.              5.945   5.658   5.551   4.919   4.699   4.604
## Cumulative % of var.  51.312  56.970  62.520  67.439  72.138  76.742
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.075   0.073   0.070   0.065   0.059   0.054
## % of var.              3.927   3.855   3.666   3.432   3.105   2.820
## Cumulative % of var.  80.669  84.524  88.190  91.622  94.727  97.547
##                       Dim.19
## Variance               0.047
## % of var.              2.453
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  0.295  0.167  0.044 |  0.882  1.563  0.392 |
## 2                    |  0.096  0.017  0.005 |  0.732  1.078  0.281 |
## 3                    | -0.058  0.007  0.002 | -0.148  0.044  0.010 |
## 4                    |  0.416  0.330  0.146 | -0.076  0.011  0.005 |
## 5                    |  0.003  0.000  0.000 |  0.325  0.213  0.066 |
## 6                    |  0.320  0.196  0.074 |  0.045  0.004  0.001 |
## 7                    | -0.105  0.021  0.005 |  0.186  0.069  0.015 |
## 8                    |  0.511  0.500  0.128 | -0.101  0.021  0.005 |
## 9                    | -0.179  0.062  0.016 |  0.402  0.324  0.082 |
## 10                   | -0.282  0.152  0.035 |  0.920  1.703  0.378 |
##                       Dim.3    ctr   cos2  
## 1                    -0.227  0.120  0.026 |
## 2                    -0.400  0.371  0.084 |
## 3                    -0.469  0.511  0.101 |
## 4                    -0.363  0.306  0.111 |
## 5                    -0.391  0.355  0.095 |
## 6                    -0.470  0.512  0.158 |
## 7                     0.370  0.318  0.060 |
## 8                    -0.080  0.015  0.003 |
## 9                     0.499  0.579  0.127 |
## 10                    0.353  0.289  0.056 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |   0.017   0.004   0.000   0.170 |   0.537   4.294
## Earl Grey            |  -0.186   1.281   0.063  -4.326 |  -0.278   3.004
## green                |   1.051   6.971   0.136   6.387 |   0.423   1.185
## F                    |  -0.339   3.912   0.168  -7.078 |  -0.247   2.191
## M                    |   0.494   5.707   0.168   7.078 |   0.361   3.197
## chain store          |   0.183   1.236   0.060   4.229 |  -0.138   0.739
## chain store+tea shop |  -0.817   9.950   0.234  -8.369 |   0.180   0.509
## tea shop             |   0.949   5.170   0.100   5.470 |   0.417   1.049
## employee             |   0.135   0.206   0.004   1.155 |   0.070   0.059
## middle               |   0.196   0.295   0.006   1.333 |   0.526   2.222
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.095   5.316 |   0.211   0.765   0.015   2.087 |
## Earl Grey              0.140  -6.462 |  -0.125   0.695   0.028  -2.892 |
## green                  0.022   2.569 |   0.255   0.499   0.008   1.552 |
## F                      0.089  -5.168 |  -0.266   2.924   0.103  -5.556 |
## M                      0.089   5.168 |   0.388   4.267   0.103   5.556 |
## chain store            0.034  -3.190 |  -0.461   9.490   0.378 -10.638 |
## chain store+tea shop   0.011   1.846 |   0.573   5.936   0.115   5.868 |
## tea shop               0.019   2.404 |   1.464  14.937   0.238   8.441 |
## employee               0.001   0.602 |  -0.797   8.709   0.156  -6.822 |
## middle                 0.043   3.565 |   0.426   1.688   0.028   2.892 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.144 0.141 0.028 |
## sex                  | 0.168 0.089 0.103 |
## where                | 0.285 0.038 0.436 |
## SPC                  | 0.150 0.270 0.444 |
## friends              | 0.119 0.162 0.038 |
## healthy              | 0.011 0.136 0.050 |
## sophisticated        | 0.032 0.049 0.033 |
## frequency            | 0.405 0.391 0.184 |
## breakfast            | 0.206 0.351 0.023 |
## lunch                | 0.222 0.030 0.097 |
# visualize MCA
plot(mca, invisible=c("var"), habillage = "quali")

plot(mca, invisible=c("ind"), habillage = "quali")

From the summary of the model, we can see that none of the dimensions identified account for more than 10% of the variance in the data. The first plot shows how the individuals divide on the two dimensions, while the second plot shows how the variables relate to the dimensions. We can see that most individuals fall in the middle of the plot and there are no clear outlier cases. This could be related to the fact there are no strong dimensions in the data.

When we look at the variable plot we can see some patterns in the data. Buying tea from tea shops, consuming green tea, consuming tea often and alone, and, interestingly, being a workman are found close to each other. On the other hand, being a student, consuming tea less frequently and considering tea unhealthy are also related as are being a senior and drinking tea in the morning. Nevertheless, there are few clearly identifiable patterns in the data when we include these ten variables.